Identification of biomarkers predictive of metastasis development in early-stage colorectal cancer using network-based regularization

Colorectal cancer (CRC) is the third most common cancer and the second most deathly worldwide. It is a very heterogeneous disease that can develop via distinct pathways where metastasis is the primary cause of death. Therefore, it is crucial to understand the molecular mechanisms underlying metastasis. RNA-sequencing is an essential tool used for studying the transcriptional landscape. However, the high-dimensionality of gene expression data makes selecting novel metastatic biomarkers problematic. To distinguish early-stage CRC patients at risk of developing metastasis from those that are not, three types of binary classification approaches were used: (1) classification methods (decision trees, linear and radial kernel support vector machines, logistic regression, and random forest) using differentially expressed genes (DEGs) as input features; (2) regularized logistic regression based on the Elastic Net penalty and the proposed iTwiner—a network-based regularizer accounting for gene correlation information; and (3) classification methods based on the genes pre-selected using regularized logistic regression. Classifiers using the DEGs as features showed similar results, with random forest showing the highest accuracy. Using regularized logistic regression on the full dataset yielded no improvement in the methods’ accuracy. Further classification using the pre-selected genes found by different penalty factors, instead of the DEGs, significantly improved the accuracy of the binary classifiers. Moreover, the use of network-based correlation information (iTwiner) for gene selection produced the best classification results and the identification of more stable and robust gene sets. Some are known to be tumor suppressor genes (OPCML-IT2), to be related to resistance to cancer therapies (RAC1P3), or to be involved in several cancer processes such as genome stability (XRCC6P2), tumor growth and metastasis (MIR602) and regulation of gene transcription (NME2P2). We show that the classification of CRC patients based on pre-selected features by regularized logistic regression is a valuable alternative to using DEGs, significantly increasing the models’ predictive performance. Moreover, the use of correlation-based penalization for biomarker selection stands as a promising strategy for predicting patients’ groups based on RNA-seq data. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-022-05104-z.


Introduction
Colorectal cancer (CRC) is one of the leading causes of cancer-related deaths worldwide. In 2018, it was the third most common cancer, with around 1.8 million new cases and the second most deathly cancer with almost 900 thousand deaths (9% of all cancerrelated deaths) [1]. CRC begins as a benign adenomatous polyp, which develops into an advanced adenoma with high grade dysplasia and then progresses to invasive cancer. Invasive cancers that are confined within the wall of the colon (stages I and II) are curable. However, if untreated, they may spread to regional lymph nodes (stage III) or later metastasize to distant sites (stage IV) [2].
CRC is a very heterogeneous disease that can develop via distinct pathways involving different combinations of genetic and epigenetic changes [3]. These genetic differences between patients may lead to differences in susceptibility where cancers deriving from the same tissue may be stratified into disease subtypes [4]. Genetic and epigenetic heterogeneity poses a problem for the diagnosis and therapy of cancer. For example, it can lead to incorrect treatment decisions. CRC has three main types known, divided by their origin and expression: sporadic form (60%-80% of the cases), family type (20%-40%) and hereditary type [5]. Sporadic CRC may appear in individuals who carry no mutation that makes them susceptible to developing this type of cancer. Regarding the family type, no gene has been found to be related to the disease. However, there is a higher chance of developing this tumor when family members have suffered from sporadic colon cancer. In these cases, environmental factors play a critical role. Hereditary type may be divided into two subtypes whether patients show adenomatous polyps -familial adenomatous polyposis (FAP), or not -hereditary nonpolyposis colorectal cancer [5].
Metastasis is the major cause of death in CRC patients, and approximately 20% of the patients already have metastases at diagnosis [6]. In this context, it is vital to diagnose CRC at an early-stage and accurately identify patients likely to progress to metastasis in order to improve CRC patients outcomes. Tumor surgical removal is the treatment of choice for early localized CRC disease (stage II-III) [7]. 50% of stage III patients are cured by surgery, whereas 20% of patients will survive due to the addition of adjuvant chemotherapy and 30% will relapse in 2-3 years [8]. Altogether, only 20% of stage III patients benefit from chemotherapy, exposing 80% of patients to unnecessary toxicity [9]. Therefore, one of the main challenges is to identify those stage II-III CRC patients where adjuvant chemotherapy is crucial to improve their outcomes.
Many studies try to understand tumor biology and mechanisms that lead to metastasis; notwithstanding, the identification of the factors influencing metastatic tumor cells, especially in colorectal cancer, remains poor [10]. Consequently, over the years, there was an increase in molecular profiling of tumors using next-generation sequencing (NGS), such as RNA sequencing (RNA-seq), which constitutes an important tool widely used in cancer research for studying the transcriptional landscape and molecular pathways [11].
Supervised learning comes as a natural choice for helping in the classification of patients into metastatic and non-metastatic, based on NGS data. Some of the widely used classifiers applied to RNA-seq data are Logistic Regression (LR), Decision Tree (DT), Random Forest (RF), and Support Vector Machine (SVM) [12][13][14]. However, despite the invaluable information provided by NGS, the intrinsic high dimensionality of gene expression data may compromise the classification learning task and severally hamper an accurate selection of biomarkers. Therefore, feature selection plays a pivotal role in the selection of informative genes preceding the classification of RNA-Seq data for disease prediction and diagnosis, to enhance accuracy in disease classification [15]. Furthermore, ranking of the features according to their relevance to the classification problem and further selection of the best ones can improve the performance of the prediction model [13].
Another common way to address the data high-dimensionality challenge is to use classification algorithms that control the model's complexity through regularization [16,17]. One option is to regularize the log-likelihood function of the LR model. Two of the most commonly used penalties are the lasso ( ℓ 1 -norm) and the ridge ( ℓ 2 -norm) [14] regularizers, whose combination leads to the Elastic Net [18]. Network analysis has also shown enormous potential in precision medicine, helping to identify key biomarkers and therapeutic targets in cancer [19]. Several studies used network-based regularizers to improve model accuracy and interpretability. Prior network knowledge may be based on proteinprotein interactions [20], or from the correlation matrix of the gene expression values [21,22].
In sum, several studies demonstrated that using supervised learning methods in microarray gene expression data [23] is a very promising technique and that the integration of gene expression profiles with network information may help to identify markers correlated with metastasis [24]. Also, in the context of colorectal cancer, some classifiers developed to investigate metastasis were based only on clinical data (e.g., sex, age at diagnosis, histological subtype, stage, primary site) [25]. Therefore, there is still an urgent need of methods for the identification of factors influencing metastatic tumor cells, especially in colorectal cancer.
In this work, we try to find a set of biomarkers that may predict the risk of metastasis using transcriptomic data from a cohort of CRC patients followed at the Hospital de Santa Maria (Lisbon), one of the largest hospitals in Portugal.
To achieve this goal, we applied and tested different classification methods using transcriptomic data, and proposed a new combined model that showed higher classification accuracy compared to its model counterparts. Altogether, we proposed a new pipeline for the selection of putative biomarkers based on patients' gene correlation matrices.

Materials and methods
To identify important genes involved in the CRC metastasis process, several classification methods applied to RNA-seq data were tested. The pipeline of this study is represented in Fig. 1. All the methods were implemented in the R statistical software [26] and the corresponding code is available at https:// github. com/ sysbi omed/ iTwin er. git.

Datasets
Primary tumor samples from patients diagnosed with CRC disease from June 2010 to October 2017 were collected as part of a prospective biobanking project approved by the Ethical Committee of Hospital de Santa Maria, all procedures were performed in accordance with relevant guidelines. Patients were followed at the Oncology Division of Hospital Santa Maria, Lisbon, and were treated as per institutional clinical practice in accordance with international guidelines, namely ESMO and NCCN guidelines. Cases were staged according to The American Joint Committee on Cancer (AJCC) staging system, 8th edition, and patients had not received neoadjuvant chemo or radiotherapy prior to sample collection. Whole transcriptome sequencing (WTS) was performed by Illumina Inc.
The dataset used in this study comprises 110 samples from early-stage (II and III) CRC patients with both clinical and transcriptomic (RNA-seq) data. This was obtained from two different cohorts of CRC patients from Hospital Santa Maria (Lisbon, Portugal): Cohort 1: Cohort described in [27] containing 111 samples, available under accession number EGAS00001005276 (European Genome-Phenome Archive). This cohort has 26 samples from primary stage II-III colorectal tumors that did not metastasize, 34 primary stage II-III colorectal tumors that metastasize in three years of follow up, 12 adjacent normal colonic mucosa, and 39 metastasis of CRC patients. From this cohort, only the primary colorectal tumors samples were used ( n T 1 = 45 ), from early-stage CRC that metastasize ( n PM1 = 19 ), and did not metastasize ( n P1 = 26 ). Cohort 2: Cohort described in [28] containing 114 samples, already available in NCBI Database under accession number PRJNA689313. We used n T 2 = 65 samples that correspond to earlystage CRC that metastasize ( n PM2 = 11 ) and early-stage CRC that did not metastasize ( n P2 = 54).
The clinical dataset descriptive statistics are summarized in Table 1. The sex (Female or Male), tissue of cancer primary site (Colon or Rectum), stage of the disease (II or III), sidedness of primary site (Right or Left side of the colon), and age variables were selected for further analysis. For the classification methods, two groups of interest were selected, early-stage (II-III) patients that do not metastasize (P, n P1 + n P2 = 80 ) and early-stage patients that metastasize (PM, n PM1 + n PM2 = 30 ) during the follow-up time period. Given the resulting imbalanced groups and the problems in classification that were obtained due to class imbalance, an undersampling strategy was taken for model training by splitting the initial dataset into three different smaller datasets, i.e., DATA-SET1 ( n = 60 ), DATASET2 ( n = 55 ), and DATASET3 ( n = 55 ). For each dataset, PM patients were the same ( n = 30 ) and P patients were randomly divided into three groups ( n 1 = 30 , n 2 = 25 , n 3 = 25 ). With this strategy, we exploit all the data collected while keeping class balance in each classification procedure. Other data partitions may be tested using the available code.
The original gene expression dataset was comprised of 39,103 variables (genes). After excluding the genes with a constant expression (standard deviation of zero), a dataset with 37,504 variables (genes) was obtained.
A preliminary study of the datasets was performed to verify the statistical significance of the differences between factor variables across the P and PM groups of patients using the Fisher's Exact test, namely to the variables sex, tissue type, stage of the disease, and sidedness. For the age variable, a t-test was used to compare the mean between the two groups. Subsequent survival analysis was performed for each dataset used, with the main goal of studying the time until an event of interest, i.e., death, occurs [29]. Here, we compared the differences in survival between several groups of interest -namely, stages of the disease (II vs. III), sidedness (Right vs. Left side of the colon), and class (P vs. PM) -using the log-rank test [30].
Finally, differential gene expression analysis was performed to identify genes differentially expressed between the two patient groups (P and PM). To perform this analysis, the edgeR R software package was used, employing an FDR (false discovery rate) cut-off of 0.05 to identify differentially expressed genes (DEGs). These genes were further used for classification.

Classification methods
Classification is a supervised learning method, where the model learns from a set of predefined samples with given class labels (training dataset). The knowledge inferred from this is applied to classify unknown samples (a test dataset) accordingly [13].
In this work, three binary classification approaches were used to distinguish earlystage CRC patients that metastasize from those that did not: 1) classification methods based on a subset of relevant genes (DEGs), 2) classification via regularized logistic regression with embedded feature selection applied to the full dataset, and 3) all classifiers based on the relevant features identified by regularized logistic regression (Fig. 1).

Binary classification
Regarding binary classification, five different classifiers were tested: decision trees, support vector machines (linear and radial), logistic regression, and random forest. One of the limitations of these methods emerges when using high-dimensional data. Since a high number of features may lead to problems in classification analysis, a smaller subset containing only genes found to be differentially expressed between the groups P and PM was used. The difference in the expression level of genes is found useful in classification in order to identify disease biomarkers [13]. Decision trees (DT) are one of the most used classifiers. The tree complexity, measured by the number of nodes and number of features used, has a crucial effect on its accuracy [31]. Certain parameters for tree construction were fixed as explained in the section "Method evaluation and comparison". SVMs have been successfully applied to a wide variety of biological applications, such as the classification of microarray gene expression profiles. Here we tested both linear (svmL) and radial (svmR) kernel functions [32]. Logistic regression allows the analysis of binary outcomes using a logistic function [33]. This method will be explained in more detail below. Finally, Random forest (RF) is an ensemble learning method for classification, operating by constructing a multitude of decision trees [34] to get a more accurate and stable prediction. These classification procedures were performed using the R software caret package.

Regularized logistic regression
Another approach that has been widely used for classification problems in cancer is logistic regression [35,36]. This method is used for modeling a binary response variable [37]. In this specific case, we investigated how metastasis may be predicted using gene expression levels from early-stage CRC patients. The logistic regression model estimates the probability of belonging to a given class ( Y i = 1 ) by: where X i , i = 1, . . . , n , is the vector of the p covariates (gene expression values) of the ith patient, and β = (β 1 , β 2 , . . . , β p ) are the corresponding regression coefficients. Fig. 1 Methodological procedure of the work presented here. The full dataset was divided into three smaller datasets. Survival analysis was performed to each dataset to evaluate how stages of the disease (II vs. III), sidedness of primary tumor site in colon (Right vs. Left), and class (P-primary patients that do not metastasize vs. PM-primary patients that metastasize) are related to risk of death. Afterwards, three different approaches to classify early-stage patients that metastasize were used: (1) Classifiers without regularization (DT -decision trees, svmL-linear support vector machine, svmR-radial support vector machine, LRlogistic regression and RF-random forest) applied to subset of genes that were found differentially expressed between two groups (P vs. PM); (2) Regularized logistic regression performed on the full dataset using two different penalization factors (EN-elastic net, and iTwiner); (3) Classifiers applied to genes pre-selected by regularized logistic regression. Model performance was compared using different types of measures (e.g., accuracy and misclassifications) The parameters β of the logistic model are estimated by maximizing the log-likelihood function, given by where the binary variable y i indicates to which group observation i belongs to, either a patient known to have metastasized in the future (group PM, y i = 0 ) or to a patient whose tumor did not metastasize (group P, y i = 1).
One of the most used techniques to handle high-dimensional gene expression data is regularization [38]. The most common regularizer is the Elastic Net ( [18]), which combines the ℓ 1 -norm and the squared ℓ 2 -norm of the parameters: where 0 ≤ α ≤ 1 . When α = 1 , the least absolute shrinkage and selection operator (Lasso) is obtained, whereas α = 0 corresponds to the Ridge regression. Lasso may set coefficients to zero, resulting in a sparse model with fewer coefficients. Ridge regression, on the other hand, is a continuous shrinkage method that minimizes the residual sum of squares, keeping all the predictors in the model [39]. The parameter that controls the penalizing weight is usually chosen with cross-validation.
Incorporating network-based regularizers in classifiers may improve model interpretability leading to parameter estimation towards meaningful biological solutions. This network information may be obtained from the data correlation itself. For example, Twiner was recently proposed as a regularizer based on pairwise correlations between the features in two distinct groups A and B [21]. This method allows the selection of similarly correlated genes in two groups (e.g., in two given diseases). Here we propose a variant of Twiner, the iTwiner, in which the more different a gene's correlation pattern is between two groups (metastatic and non-metastatic), the less penalized will be in the regularization term of logistic regression.
Given two correlation matrices for A and B, , respectively, where each column σ j ∈ R p represents the correlation of each feature j = 1, . . . , p with the remaining ones, the dissimilarity measure d j (A, B) of feature j between A and B is given by the angle of the corresponding vectors The regularizer is constructed using these distances, to promote the selection of genes whose correlation patterns are more distant between A and B. The penalty term is given by The iTwiner adapts the former regularization in order to penalize, now in an inverse way, the gene expression correlation similarities between the two groups (P vs. PM). The main rationale, in this context, is to select biomarker signatures that indeed reflect the different correlation patterns between the metastatic vs. non-metastatic early-stage CRC patients.

Method evaluation and comparison
In this work we tested three different approaches to find the best CRC metastasis classifier: 1) Classifiers based on DEGs; 2) Regularized logistic regression applied to the full dataset; 3) Classifiers based on genes pre-selected from regularization (instead of DEGs).
As explained above, the original dataset was split into three smaller datasets due to the existing class imbalance. For each dataset used, samples were randomly divided into a training set (for model construction) and a test set (for model evaluation), comprising 70% and 30% of the data, respectively. To obtain statistically reliable predictive measurements, 10-fold cross-validation was performed on the training set to optimize the parameter in regularized logistic regression. Regarding decision trees, the minimum number of observations that must exist in a node in order for a split to be attempted and the minimum number of observations in the final node were fixed (minsplit = 4; minbucket = minsplit/3, respectively). After testing manually some values, these were the ones that gave the best estimated tree. Also, a 10-fold cross-validation was used across all runs to tune maxdepth and estimate the best tree, guaranteeing models' comparison. This estimation procedure and hyper-parameter optimization was performed using the R software package rpart. For support vector machine, random forest and logistic regression classifiers, the train function from caret package was used to perform hyper-parameter optimization from a training set using the default 10-fold cross-validation. To mitigate the variability of these procedures, train and test sets were randomly generated 100 times, keeping the same fixed split (70%-30%).
For the EN model the parameter that controls sparsity was set to α = 0.2 and for iTwiner α = 0.05 , which selected an adequate number of variables to be further analyzed and interpreted. Notwithstanding, different α parameters may be tested to select different gene set sizes, using the code made available.
To evaluate the models' performance, depending on the class predicted by the classifier and the true class of the patient (non-metastatic -P or metastatic -PM), four different results can be obtained: True positive (TP) -patient predicted as positive (nonmetastatic) and the patient was non-metastatic; False positive (FP) -patient predicted as positive (non-metastatic) but the patient did metastasize; True negative (TN) -patient predicted as negative (metastatic) and the patient metastasized; False negative (FN) -patient predicted as negative (metastatic) but the patient did not metastasize. Using these results, the following measures on the test set were used as indicators of the performance of the classifiers: Accuracy (fraction of correct predictions -Acc), number of misclassifications (Miscl), Sensitivity (fraction of actual positive cases), Specificity (fraction of actual negative cases) and AUC (area under the ROC curve). The median values of all performance indexes obtained for train and test sets across the 100 runs were used for comparison.
To perform the analysis described above, glmnet [40] package was used in R statistical software. The q vector was introduced as a penalty factor in the glmnet function.

Results and discussion
Different gene expression profiles are expected in early-stage patients that will metastasize compared to non-metastatic patients, as a consequence of molecular, biochemical, and genetic variations that make metastatic cells able to migrate from the primary tumor to other body sites [41]. In this work, several classification and feature selection strategies based on RNA-seq data were evaluated to distinguish early-stage (II-III) CRC patients that metastasize from those that do not, and to find a subset of genes that may be predictive of CRC metastasis.

Exploratory analysis
The data used to perform this analysis is described in Table 1. As explained before, the full dataset was divided into three. These were analyzed individually and patients were divided into several groups regarding important clinical factors such as sex (Female and Male), tissue (Colon and Rectum), stages of the disease (II and III), sidedness (Right, Left), and age. The statistical significance of the differences between groups P and PM for each clinical factor can be found in Table 1. Most clinical factors yielded no significance in the differences between groups. This is an important step to guarantee that further differences found between P and PM groups are related to gene expression data and not to possible clinical confounding factors.
Afterward, to assess if there were differences in the survival probability regarding clinical factors, survival analysis was performed (Fig. 2) for each dataset used, and the significance of the differences was determined via the log-rank test. As expected, stage III of the disease (more advanced stage), was related to a higher risk of death compared to stage II. This was observed in all datasets, however, only DATASET1 had significant results (p value = 0.0091). Also, PM patients showed worst survival probability when compared to P patients, significant in all datasets (p value < 0.001). Finally, regarding sidedness, no statistically significant results were found. However, for DATASET1 and DATASET2, there was a tendency for the right side to be related with worst survival probability, as shown in the literature [42].
Differential gene expression analysis was performed in all datasets to find differential expressed genes (DEGs) between P and PM patient groups. In DATASET1, a total of 9533 DEGs were found. Among those, 1589 were up-regulated and 7944 downregulated in PM patients. In DATASET2, 1840 DEGs were found, 835 up-regulated and 1005 down-regulated in PM. Finally, 138 DEGs were found in DATASET3, 39 up-regulated and 99 down-regulated. Given the high number of DEGs found in each dataset, a smaller gene set containing only the fifty DEGs that exhibited the lowest p-values between the two tissues was created, for ease of model building and interpretation. The list of ranked genes can be found in Table 2. To compare if there were DEGs found in common between datasets, a Venn diagram was constructed (Fig. 3).
The DEGs found in common between datasets are represented in Table 3, where log fold change (LogFC) is also shown. Negative values refer to down-regulated genes and positive values to up-regulated genes in PM patients. As we can see, twelve genes were considered DEGs in at least two datasets between tissues of early-stage metastatic patients and non-metastatic, with 3 DEGs in common between all of the datasets tested, GBP4, IDO1, IGHV4-34. Interestingly, all these three genes have important implications in immune regulation, highly relevant for cancer progression [43][44][45]. Several of the other genes identified have previously been involved in cancer cells migration, invasion and metastasis such as LRP4, LGR6, APOL1 and CXCL11 [46][47][48][49].

Classification based on the DEGs
To classify primary patients that metastasize, five distinct classification methods were used: decision trees (DT), random forest (RF), linear and radial support vector machine (svmL and svmR, respectively), and logistic regression (LR). Due to high-dimensionality problems, the full gene expression dataset cannot be directly used. Therefore, we decide to perform feature selection using only DEGs found between early-stage patients that metastasize and those that do not metastasize. This is a common approach used to reduce feature dimension before classification. Since the number of DEGs found in each dataset was different, we used the 50 DEGs with the lowest p-value described above as means to use the same gene dataset dimension as input to all classifiers in all datasets tested. After training the classifiers 100 times, several performance evaluation metrics were calculated in the test set, such as accuracy, misclassifications, sensitivity, specificity, and area under the ROC curve (AUC). The median results of all runs obtained for each dataset in the test set are displayed on Table 4 (all performances for train and test sets may be found in the Additional file 1: Table S1). Also, pairwise comparisons of the accuracy obtained for classifiers may be found on Additional file 1: Table S2. It is shown that the results are similar between the different methods tested. Nonetheless, RF was the best classifier obtained, presenting higher accuracy (0.72, 0.71, and 0.71) and AUC (0.72, 0.71, and 0.69) in all datasets tested, and the lowest number of misclassifications (5) in the test set.

Regularized logistic regression
The second approach tested to distinguish early-stage CRC patients that metastasize from those that do not, was to use regularized LR with different types of penalization for feature selection: Elastic net (EN) and the correlation-based regularizer iTwiner. The test set results obtained for these methods applied to the full dataset are described in Table 5 (results for train and test sets may be found in the Additional file 1: Table S3).
The performance of these methods was similar to the classifiers tested above, where higher accuracy in test set was obtained by the iTwiner method (mean Acc = 0.69 ). Interestingly, most of the misclassifications in DATASET2 and DATASET3 using both approaches were false negatives (FN), meaning that these methods classified wrongly patients that do not metastasize in patients that metastasize. Since non-metastatic patients can indeed metastasize in the future it would be of great value to do a follow-up on these patients that were labeled wrongly by these methods.
The median number of selected variables (genes) by the two methods (across the 100 runs) used to separate the two groups (P vs. PM) was 48 for EN and 38 for iTwiner Table 4 Median classifiers performance results (and standard deviation in parenthesis) obtained for test sets for the 100 runs tested using five classification methods applied to the fifty DEGs with lowest p-value DT (Table 5). Also, the number of genes selected in at least 50% of the 100 runs tested in EN was smaller when compared to iTwiner, indicating that the iTwiner method is more stable since more genes are consistently selected as important for the classification of early-stage patients that metastasize. Moreover, to assess which genes are being recurrently selected by the methods, independently of the dataset used, a Venn diagram was constructed (Fig. 4). Here, we compared for each regularizer (EN and iTwiner) the top fifty genes selected in each dataset (Table 6), as the most likely genes to be metastatic biomarkers in CRC patients. Interestingly, only a few biomarkers were found to be DEGs between the P and PM groups, represented in bold the downand in underline the up-regulated genes in the PM group. Also, we can see that using   iTwiner, a higher number of genes is selected in common across the three datasets tested, which once more stands as evidence of improved stability and robustness of the selected feature sets, irrespective of the dataset used. Looking closely at the genes selected by each classifier in the different datasets tested (Table 6), EN only selected RAC1P3 in common to all datasets. This gene is a pseudogene of the Rac family of small GTPase whose role in cancer is still unknown. Regarding iTwiner, six genes were selected by the three datasets tested, RAC1P3, XRCC6P2, EEF1B2P6, HSPD1P7, TRBV11-1, HORMAD2. The majority of these genes are pseudogenes with an unknown role in cancer. However, HORMAD2 has been reported to have tumor suppressor functions, and its expression was seen down-regulated in cancer [50].
Here we showed that HORMAD2 gene was down-regulated in early-stage patients that metastasize (represented in bold in Table 6).

Classification based on regularized-selected genes
The final procedure to find a set of biomarkers involved in metastasis processes of earlystage CRC patients was to use classification methods based on the features previously identified by regularized LR. In particular, using the genes pre-selected by regularization (EN, iTwiner) as an alternative to the DEGs (Fig. 1, method 3), to try to improve the classification performance.
Here, the five classifiers used earlier (DT, svmL, svmR, LR, and RF) were applied to the two smaller gene sets obtained by regularized LR. To have the same dataset dimension as before, the fifty genes selected by EN and iTwiner penalties ranked in Table 6 were used as input to the classifiers. This was done to each dataset as previously. Performances obtained for classifiers train and test sets using genes pre-selected by EN and iTwiner penalties may be found on the Additional file 1: Tables S4 and S6, respectively. Pairwise comparisons of the accuracy obtained for classifiers may be found on Additional file 1: Tables S5 and S7. Regarding classifiers applied to gene sets based on EN penalties (Table 7), for all datasets tested, the best results in test set were obtained using svmR ( Acc = 0.78, 0.76, 0.76 ) and RF ( Acc = 0.78, 0.76, 0.76 ) methods. Also, in the RF method, the specificity of the results was higher, i.e., most of the misclassifications were FN. This means that the classifier labeled patients as metastatic even though they were non-metastatic at the three years follow-up time.
Afterward, we tested the same classifiers applied to a different gene set based on iTwiner penalization ( Table 8). The best accuracy was obtained by RF classifier as before ( Acc = 0.86, 0.82, 0.76 ). However, using this iTwiner penalization improved the specificity of the classifier (Specificity = 1 for all datasets). Table 9 presents the mean performance results for all the tested combinations of classifiers and feature selection methods: DEGs found between P and PM patient group (Table 9, DEG +), genes pre-selected by EN regularizer (Table 9, EN +) and genes preselected by the iTwiner (Table 9, iTwiner +). For all gene selection methods tested, the best performance classifier was RF showing the highest accuracy and specificity. Overall, using the genes found by regularization, considering different penalty vectors (and so different information used for selection), instead of using DEGs found between groups, improved in a significant way the accuracy of the classifiers (Table 9). A pairwise  comparison using Wilcoxon rank sum test with Benjamini & Hochberg p-value correction was performed to assess the statistically significant differences between the groups (Additional file 1: Table S8).
Moreover, for most classifiers tested (DT, svmL, and RF), if the selection of genes is based on correlation matrices (iTwiner), the performance of the models increases significantly, leading to the most accurate results. To better visualize these, Fig. 5 shows boxplots of the classifiers' accuracy obtained for all gene selection methods applied to each dataset tested. Overall, when gene sets were obtained by regularization (EN + and iTwiner +), higher accuracy was obtained. This is well observed in the RF classifier (Fig. 5d).

Conclusions
CRC is one of the leading causes of cancer-related deaths worldwide, being metastasis the major cause in these patients. Therefore, it is crucial to accurately diagnose CRC at an early-stage and understand the molecular mechanisms underlying metastasis. Several studies have tried to understand tumor biology and metastasis mechanisms by Table 8 Median values (and standard deviation in parenthesis) obtained in test set for the five classification methods using fifty most frequently genes pre-selected by LR with iTwiner penalization DT-decision trees; svmL-linear support vector machine; svmR-radial support vector machine; LR-logistic regression; RF-random forest; D1-DATASET1; D2-DATASET2; D3-DATASET3; x-datasets mean; Acc-accuracy; Misclmisclassifications; FN-false negatives; Sensitivity-fraction of actual positive cases (P); Specificity-fraction of actual negative cases (PM); AUC-area under the ROC curve comparing early-stage versus metastatic tumors. We explore the relevance of studying early-stage (II-III) tumors that do not metastasize versus those that metastasize, in three years of follow-up. However, this is not an easy task since the high-dimensionality of gene expression data leads to problems in classification methods. As such, feature selection methods are important for selecting informative genes prior to classification, to improve their accuracy.
Here we present two major contributions to the discovery of metastatic biomarkers in CRC based on classification and feature selection. The first contribution is a new network-based feature selection method, iTwiner, that promotes the selection of genes with distinct correlation patterns in metastatic and non-metastatic patients, and has shown to significantly increase the classifiers' predictive performance. Moreover, the proposed iTwiner regularizer selected the most stable and robust gene sets, including tumor suppressor genes and genes involved in several cancer processes like tumor growth and metastasis.
The second contribution proposes using gene sets pre-selected by regularized LR (via EN and iTwiner) as input features in the classification learning task, with proven improved performance compared to using DEGs as features, across many datasets and classifiers tested. Correlation-based penalization via the iTwiner penalty selected the best gene set for accurately distinguishing the two groups of patients, placing iTwiner as a promising strategy in the classification of CRC patients based on RNA-seq data and for the disclosure of biomarkers of CRC metastasis.
As future work, other types of classifiers may be tested, such as Gradient Boosting, Gaussian Process or neural networks, and since different hyper-parameter values may